{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "6f4266e5-047d-4e49-a79a-85c99759e1c6",
   "metadata": {},
   "source": [
    "## PII Redactor Example Notebook"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "12bab93e-70ee-4956-821a-2080db9c11a3",
   "metadata": {},
   "source": [
    "\n",
    "**Author**: Pooja Holkar ,\n",
    "**email**:poholkar@in.ibm.com\n",
    "\n",
    "Click link to open notebook in google colab:  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/data-prep-kit/data-prep-kit/blob/dev/examples/notebooks/PII/Run_your_first_PII_redactor_transform.ipynb)\n",
    "\n",
    "\n",
    "### What is a PII Redactor?\n",
    "\n",
    "A PII (Personally Identifiable Information) Redactor is a tool designed to identify and redact sensitive information in text data. PII includes details that can be used to identify an individual, such as:\n",
    "\n",
    "Names\n",
    "Email addresses\n",
    "Phone numbers\n",
    "Addresses\n",
    "Financial details (e.g., credit card numbers)\n",
    "\n",
    "### Overview of the use case\n",
    "In this usecase, the PII Redactor is applied to text extracted from invoices to ensure sensitive customer information is not exposed during processing, sharing, or storage.\n",
    "\n",
    " **Workflow Overview**\n",
    "\n",
    "- **Extracting and Converting Text:** The content of the invoice, originally in PDF format, is processed using the pdf2parquet transform to extract the text and convert it into a structured Parquet file, enabling easier handling and downstream processing.\n",
    "\n",
    "- **Redacting Sensitive Information:** The generated Parquet file serves as the input for the pii_redactor_transform. This step scans the invoice data for personally identifiable information (PII) and applies masking techniques to redact any sensitive content, ensuring data privacy and compliance.\n",
    "\n",
    "- **Creating the Final Output:** After the redaction process, a new output Parquet file is generated in **output-redacted** folder, containing the same structured data as the original but with all sensitive details securely masked to prevent unauthorized access or exposure.\n",
    "\n",
    "\n",
    "### Why is PII Redaction Important?\n",
    "\n",
    " **Data Privacy Compliance**: Adheres to regulations like GDPR, HIPAA, or CCPA that mandate safeguarding customer information.\n",
    "\n",
    " **Risk Mitigation**: Prevents unauthorized access to or misuse of sensitive data.\n",
    "\n",
    " **Automation Benefits**: Simplifies and accelerates the process of securing information in large-scale document handling.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "10a3ed52-68aa-45bb-a2d3-1b08e5ad2ab2",
   "metadata": {},
   "source": [
    "### Pre-req: Install data-prep-kit dependencies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "fd7131c9-af67-4ac5-9d3d-a90ab7d11898",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture logpip --no-stderr\n",
    "!pip install data-prep-toolkit==0.2.2\n",
    "!pip install 'data-prep-toolkit-transforms[pii_redactor]==0.2.2'\n",
    "!pip install 'data-prep-toolkit-transforms[pdf2parquet]==0.2.2'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b3cfa3c3-6619-4cd4-887f-6d269f896210",
   "metadata": {},
   "source": [
    "###  Figure out Runtime Environment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "453b6ba9-253a-47c7-9b6c-999d01734114",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "NOT in Colab\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "\n",
    "if os.getenv(\"COLAB_RELEASE_TAG\"):\n",
    "   print(\"Running in Colab\")\n",
    "   RUNNING_IN_COLAB = True\n",
    "else:\n",
    "   print(\"NOT in Colab\")\n",
    "   RUNNING_IN_COLAB = False"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9cab116e-b772-4e4b-ab2e-d22593b26a7a",
   "metadata": {},
   "source": [
    "### Download Data if running on Google Colab"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "7e6d4c4d-990a-4938-b39a-ba761a638931",
   "metadata": {},
   "outputs": [],
   "source": [
    "if RUNNING_IN_COLAB:\n",
    "  !mkdir -p 'input-data'\n",
    "  !wget -O 'input-data/Invoice.pdf' 'https://raw.githubusercontent.com/PoojaHolkar/data-prep-kit/refs/heads/dev/examples/notebooks/PII/input-data/Invoice.pdf'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2d6f9a32-6ae4-4bf2-b939-70b579f18185",
   "metadata": {},
   "source": [
    "## Step 1: Configuration"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e34249a4-cf6e-4879-a8ed-3092653874a8",
   "metadata": {},
   "source": [
    "### Import necessary libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "b2186d9f-f09f-4831-be8b-ab04bfe63e41",
   "metadata": {},
   "outputs": [],
   "source": [
    "import ast\n",
    "import os\n",
    "import sys\n",
    "from data_processing.runtime.pure_python import PythonTransformLauncher\n",
    "from data_processing.utils import ParamsUtils\n",
    "from pdf2parquet_transform_python import Pdf2ParquetPythonTransformConfiguration"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a7c6ecec-3744-41da-839d-14fb4b692ff8",
   "metadata": {},
   "source": [
    "### Create input/outpur directories"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "84e89404-3c85-4f72-92b9-bb209642735c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# create parameters\n",
    "input_folder = os.path.join(\"input-data\")\n",
    "output_folder = os.path.join( \"output\")\n",
    "local_conf = {\n",
    "    \"input_folder\": input_folder,\n",
    "    \"output_folder\": output_folder,\n",
    "}\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8579a083-bcf5-4ee1-95f0-af7e2d0c7389",
   "metadata": {},
   "source": [
    "### Setup runtime parameters for the transform"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "517c8fd7-4353-4646-9c76-913a43a01816",
   "metadata": {},
   "outputs": [],
   "source": [
    "params = {\n",
    "    # Data access. Only required parameters are specified\n",
    "    \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n",
    "    \"data_files_to_use\": ast.literal_eval(\"['.pdf','.docx','.pptx','.zip']\"),\n",
    "    # execution info\n",
    "    \"runtime_pipeline_id\": \"pipeline_id\",\n",
    "    \"runtime_job_id\": \"job_id\",\n",
    "    # pdf2parquet params\n",
    "    \"pdf2parquet_double_precision\": 0,\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "348d002f-1458-48c7-987c-ba45de4cb2f2",
   "metadata": {},
   "source": [
    "## Step 2: Invoke Pdf2Parquet transform"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "93845ab6-0338-455e-9a84-c588326f7711",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "18:47:39 INFO - pdf2parquet parameters are : {'batch_size': -1, 'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.MARKDOWN: 'text/markdown'>, 'do_table_structure': True, 'do_ocr': True, 'ocr_engine': <pdf2parquet_ocr_engine.EASYOCR: 'easyocr'>, 'bitmap_area_threshold': 0.05, 'pdf_backend': <pdf2parquet_pdf_backend.DLPARSE_V2: 'dlparse_v2'>, 'double_precision': 0}\n",
      "18:47:39 INFO - pipeline id pipeline_id\n",
      "18:47:39 INFO - code location None\n",
      "18:47:39 INFO - data factory data_ is using local data access: input_folder - input-data output_folder - output\n",
      "18:47:39 INFO - data factory data_ max_files -1, n_sample -1\n",
      "18:47:39 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf', '.docx', '.pptx', '.zip'], files to checkpoint ['.parquet']\n",
      "18:47:39 INFO - orchestrator pdf2parquet started at 2024-12-22 18:47:39\n",
      "18:47:39 INFO - Number of files is 1, source profile {'max_file_size': 0.03161430358886719, 'min_file_size': 0.03161430358886719, 'total_file_size': 0.03161430358886719}\n",
      "18:47:39 INFO - Initializing models\n",
      "18:47:45 INFO - Completed 1 files (100.0%) in 0.029 min\n",
      "18:47:45 INFO - Done processing 1 files, waiting for flush() completion.\n",
      "18:47:45 INFO - done flushing in 0.0 sec\n",
      "18:47:45 INFO - Completed execution in 0.092 min, execution result 0\n"
     ]
    }
   ],
   "source": [
    "%%capture\n",
    "sys.argv = ParamsUtils.dict_to_req(d=params)\n",
    "launcher = PythonTransformLauncher(runtime_config=Pdf2ParquetPythonTransformConfiguration())\n",
    "launcher.launch()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9ed69398-1ab0-43d8-a796-f8898bf1dbf8",
   "metadata": {},
   "source": [
    "### Verify the input parquet created in output folder"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "102fe0e7-68ac-4da8-9655-55a9233fc7bb",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['output/Invoice.parquet', 'output/metadata.json']"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import glob\n",
    "glob.glob(\"output/*\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "51ac9479-d79c-4928-bb50-b0972a46b008",
   "metadata": {},
   "source": [
    "## Step 3: Import necessary PIIRedactor libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "31f4a49a-8ce5-4396-bf3c-1b06cab9aad1",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pii_redactor_transform import doc_transformed_contents_cli_param\n",
    "from pii_redactor_transform_python import PIIRedactorPythonTransformConfiguration\n",
    "\n",
    "\n",
    "# create parameters\n",
    "input_folder = os.path.abspath(os.path.join(os.getcwd(), \"output\"))\n",
    "output_folder = os.path.abspath(os.path.join(os.getcwd(), \"output-redacted\"))\n",
    "local_conf = {\n",
    "    \"input_folder\": input_folder,\n",
    "    \"output_folder\": output_folder,\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "07643473-5ca0-4392-bd1f-b140c551c0eb",
   "metadata": {},
   "source": [
    "## Step 4: Invoke PII Redactor configuration transform"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "24d847b4",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from pii_redactor_transform import PIIRedactorTransform\n",
    "\n",
    "\n",
    "config = {\n",
    "    \"entities\": [\"PERSON\", \"EMAIL_ADDRESS\", \"PHONE_NUMBER\", \"LOCATION\"],\n",
    "    \"operator\": \"replace\",\n",
    "    \"transformed_contents\": \"redacted_contents\",\n",
    "    \"score_threshold\": 0.7\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "3d3ef8a5",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "18:47:46 INFO - pipeline id pipeline_id\n",
      "18:47:46 INFO - code location None\n",
      "18:47:46 INFO - data factory data_ is using local data access: input_folder - /Users/poojaholkar/GSI/WATSONX/WATSONXDATA/DPK/GITHUBCOPY/poojalocalupdated/data-prep-kit/examples/notebooks/PII/output output_folder - /Users/poojaholkar/GSI/WATSONX/WATSONXDATA/DPK/GITHUBCOPY/poojalocalupdated/data-prep-kit/examples/notebooks/PII/output-redacted\n",
      "18:47:46 INFO - data factory data_ max_files -1, n_sample -1\n",
      "18:47:46 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n",
      "18:47:46 INFO - orchestrator pii_redactor started at 2024-12-22 18:47:46\n",
      "18:47:46 INFO - Number of files is 1, source profile {'max_file_size': 0.012392044067382812, 'min_file_size': 0.012392044067382812, 'total_file_size': 0.012392044067382812}\n",
      "18:47:46 INFO - Loading model from flair/ner-english-large\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2024-12-22 18:47:59,263 SequenceTagger predicts: Dictionary with 20 tags: <unk>, O, S-ORG, S-MISC, B-PER, E-PER, S-LOC, B-ORG, E-ORG, I-PER, S-PER, B-MISC, I-MISC, E-MISC, I-ORG, B-LOC, E-LOC, I-LOC, <START>, <STOP>\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "18:48:00 INFO - Completed 1 files (100.0%) in 0.007 min\n",
      "18:48:00 INFO - Done processing 1 files, waiting for flush() completion.\n",
      "18:48:00 INFO - done flushing in 0.0 sec\n",
      "18:48:00 INFO - Completed execution in 0.225 min, execution result 0\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "params = {\"pii_redactor_transformed_contents\": \"new_contents\",\"data_local_config\": local_conf}\n",
    "sys.argv = ParamsUtils.dict_to_req(d=params)\n",
    "launcher = PythonTransformLauncher(runtime_config=PIIRedactorPythonTransformConfiguration())\n",
    "launcher.launch()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "92d77a3e-37fe-40a5-92e8-6e8ec509ac73",
   "metadata": {},
   "source": [
    "### Step 5: Display Output in a Readable Format with masked PII information"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "ca040517-6c90-4f87-9eef-a618f77a1a6b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<ORGANIZATION>.\n",
      "\n",
      "Invoice Details:\n",
      "\n",
      "Invoice Number: INV-2024-001\n",
      "\n",
      "Invoice Date: November 15, 2024\n",
      "\n",
      "Invoice Date: November 15, 2024\n",
      "\n",
      "Due Date: November 30, 2024\n",
      "\n",
      "Billing Information:\n",
      "\n",
      "Customer Name: <PERSON>\n",
      "\n",
      "Customer Name: <PERSON>\n",
      "\n",
      "Address: 123 <LOCATION>, Apt 45, <LOCATION>, <LOCATION> 62704\n",
      "\n",
      "Email: <EMAIL_ADDRESS>\n",
      "\n",
      "Phone: <PHONE_NUMBER>\n",
      "\n",
      "Shipping Information:\n",
      "\n",
      "Recipient Name: <PERSON>\n",
      "\n",
      "Recipient Name: <PERSON>\n",
      "\n",
      "Address: 123 <LOCATION>, Apt 45, <LOCATION>, <LOCATION> 62704\n",
      "\n",
      "## Item Details:\n",
      "\n",
      "| Description               | Quantity   | Unit Price   | Total                               |\n",
      "|---------------------------|------------|--------------|-------------------------------------|\n",
      "| MacBook Air (13-inch, M2) | 1          | $999.00      | $999.00                             |\n",
      "| 1                         |            | $199.00      | <ORGANIZATION>+ for MacBook Air  $199.00 |\n",
      "\n",
      "## INVOICE\n",
      "\n",
      "Transaction ID: 9876543210ABCDE\n",
      "\n",
      "Notes:\n",
      "\n",
      "Thank you for your purchase!\n",
      "\n",
      "For assistance, please contact our support team at <EMAIL_ADDRESS> or 1-800-MY-APPLE.\n",
      "\n",
      "Subtotal:\n",
      "\n",
      "$1,198.00\n",
      "\n",
      "Tax (8%):\n",
      "\n",
      "$95.84\n",
      "\n",
      "Total Amount Due:\n",
      "\n",
      "$1,293.84\n",
      "['ORGANIZATION' 'PERSON' 'PERSON' 'LOCATION' 'LOCATION' 'LOCATION'\n",
      " 'EMAIL_ADDRESS' 'PERSON' 'PERSON' 'LOCATION' 'LOCATION' 'LOCATION'\n",
      " 'EMAIL_ADDRESS' 'ORGANIZATION' 'PHONE_NUMBER' 'ORGANIZATION']\n"
     ]
    }
   ],
   "source": [
    "data = pd.read_parquet('output-redacted/Invoice.parquet')\n",
    "print(data[\"new_contents\"][0])\n",
    "print(data[\"detected_pii\"][0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2c69c356-df95-403f-ae6a-832d39ea0f81",
   "metadata": {},
   "source": [
    "<br>\n",
    "<br>\n",
    "\n",
    "### This notebook effectively demonstrates how to seamlessly apply redaction for PII entities"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "data-prep-kit",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
